120

Applications in Natural Language Processing

derivative in the backward propagation. In detail, for the weight of binarized linear layers,

the common practice is to redistribute the weight to zero-mean for retaining representation

information [199] and applies scaling factors to minimize quantization errors [199]. The

activation is binarized by the sign without re-scaling for computational efficiency. Thus, the

computation can be expressed as

bi-linear(X) =αw(sign(X)sign(Wμ(W))),

αw = 1

nW1,

(5.3)

where W and X denote full-precision weight and activation, μ(·) denotes the mean value,

αw is the scaling factors for weight, anddenotes the matrix multiplication with bitwise

xnor and bitcount. Besides, the quantization of activation X in Eq. (5.3) is set to higher

bit-widths in some works to boost the performance of binarized BERT [6, 222].

The input data first passes through a quantized embedding layer before being fed into

the transformer blocks [285, 6]. And each transformer block consists of two main components

are the Multi-Head Attention (MHA) module and the Feed-Forward Network (FFN). The

computation of MHA depends on queries Q, keys K, and values V, which are derived from

hidden states HRN×D. N represents the length of the sequence, and D represents the

dimension of features. For a specific transformer layer, the computation in an attention

head can be expressed as

Q = bi-linearQ(H),

K = bi-linearK(H),

V = bi-linearV (H),

(5.4)

where bi-linearQ, bi-linearK, and bi-linearV represent three different binarized linear layers

for Q, K, and V, respectively. Then the attention score A is computed as follows:

A =

1

D



BQBK



,

BQ = sign(Q),

BK = sign(K),

(5.5)

where BQ and BK are the binarized query and key, respectively. Note that the obtained

attention weight is then truncated by attention mask, and each row in A can be regarded

as a k-dim vector, where k is the number of unmasked elements. Then attention weights

Bs

A are binarized as

Bs

A = sign(softmax(A)).

(5.6)

Despite the appealing properties of network binarization for relieving the massive pa-

rameters and FLOPs, it is technically hard from an optimization perspective for BERT

binarization. As illustrated in Fig. 5.1, the performance for quantized BERT drops mildly

from 32-bit to as low as 2-bit, i.e., around 0.6%on MRPC and 0.2%on MNLI-m of

the GLUE benchmark [230]. However, when reducing the bit-width to one, the performance

drops sharply, i.e.,3.8%and0.9%on the two tasks. In summary, binarization

of BERT brings severe performance degradation compared with other weight bit-widths.

Therefore, BERT binarization remains a challenging yet valuable task for academia and in-

dustries. This section surveys existing works and advances for binarizing BERT pre-trained

models.